{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# word2vec as Matrix Factorization\n",
    "\n",
    "Let's start with a simple training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = ['The student speaks, as the student wants to learn. We learn what the student wants.']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (Later)\n",
    "# There are plenty of books freely available online!\n",
    "# The Adventures of Sherlock Holmes by Arthur Conan Doyle\n",
    "# http://www.gutenberg.org/ebooks/1661\n",
    "'''!wget http://www.gutenberg.org/files/1661/1661-0.txt\n",
    "with open('1661-0.txt') as f:\n",
    "    train = [f.read()]'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# (Even later)\n",
    "# Bigger dataset of 100 MB (17M words)\n",
    "'''!wget http://mattmahoney.net/dc/text8.zip\n",
    "!unzip text8.zip\n",
    "with open('text8') as f:\n",
    "    train = [f.read()]'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(15, 'words')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(train[0].split()), 'words'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's standardize the text (lowercase, remove punctuation, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the', 'student', 'speaks', 'as', 'the', 'student', 'wants', 'to', 'learn', 'we', 'learn', 'what', 'the', 'student', 'wants']\n",
      "CPU times: user 1.12 s, sys: 998 ms, total: 2.12 s\n",
      "Wall time: 391 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "transformer = CountVectorizer()\n",
    "transformer.fit(train)\n",
    "analyzer = transformer.build_analyzer()\n",
    "\n",
    "tokens = analyzer(train[0])\n",
    "print(tokens)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'the': 4,\n",
       " 'student': 3,\n",
       " 'speaks': 2,\n",
       " 'as': 0,\n",
       " 'wants': 6,\n",
       " 'to': 5,\n",
       " 'learn': 1,\n",
       " 'we': 7,\n",
       " 'what': 8}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encoder = transformer.vocabulary_\n",
    "encoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['as', 'learn', 'speaks', 'student', 'the', 'to', 'wants', 'we', 'what']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decoder = transformer.get_feature_names()\n",
    "decoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The context is a window of size `WINDOW_SIZE` around each word of the corpus.\n",
    "\n",
    "If the corpus if $w_0, \\ldots, w_{n - 1}$, the context of a word $w_i$ is all words $w_{i - L}, w_{i - L + 1}, \\ldots, w_{i + L}$ where $L$ represents the `WINDOW_SIZE`.\n",
    "\n",
    "Write a piece of code that builds a **word**-context count matrix. Be careful of corner cases.\n",
    "\n",
    "**The** student $\\rightarrow$ *student* is a context of ***The***, so we should increment that word-context pair  \n",
    "The **student** speaks $\\rightarrow$ *The* and *speaks* are contexts of ***student***  \n",
    "student **speaks** as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from collections import Counter\n",
    "\n",
    "WINDOW_SIZE = 1  # Should be 1 as a start, then 5 for bigger corpuses\n",
    "counts = Counter()  # Should contain the number of word-context occurrences (keys are pairs)\n",
    "nb_word = Counter()  # Should contain the number of occurrences of each word\n",
    "nb_context = Counter()  # Should contain the number of occurrences of each context\n",
    "\n",
    "for pos, word in enumerate(tokens):\n",
    "    # Your code here\n",
    "    pass\n",
    "# Check counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now build a word-context PMI matrix (*pointwise mutual information*), empirically given by:\n",
    "\n",
    "$$ PMI(w, c) = \\log \\frac{P(w, c)}{P(w)P(c)} = \\log \\frac{\\#(w, c) |D|}{\\#(w) \\#(c)} $$\n",
    "\n",
    "where $|D|$ is the number of word-context pairs in the corpus, $\\#(w), \\#(c), \\#(w, c)$ are respectively the number of occurrences of word, context and word-context pair.\n",
    "\n",
    "This matrix will be sparse: please populate `rows`, `cols`, and `data` lists, for word indices (using `encoder` defined as vocabulary), contexts, and counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "import numpy as np\n",
    "\n",
    "rows = []  # Contains words indices\n",
    "cols = []  # Contains context indices\n",
    "data = []  # Contains values of the matrix\n",
    "\n",
    "for (word, context), count in counts.items():\n",
    "    # Your code here\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.sparse import csr_matrix\n",
    "from scipy.sparse.linalg import svds\n",
    "\n",
    "pmi = csr_matrix((data, (rows, cols)), shape=(len(nb_word), len(nb_context)))\n",
    "pmi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to compute the SVD of this matrix to reduce the dimensionality.\n",
    "\n",
    "The (compact) singular value decomposition of a $m \\times n$ matrix $M$ of rank $r$ is $U \\Sigma V^T$ where:\n",
    "\n",
    "- $U$ is semi-unitary and of size $m \\times r$\n",
    "- $\\Sigma$ is diagonal, $r \\times r$\n",
    "- $V^T$ is semi-unitary and of size $r \\times n$, i.e. $U^T U = V^T V = I_{r \\times r}$.\n",
    "\n",
    "We usually order the singular values in decreasing order, to explain as much variance as possible ($k$-SVD is the best approximation of rank $k$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "u, sigma, vt = svds(pmi, k=3)\n",
    "# 1 min 55 s on text8"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pmi.min(), pmi.max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = u * sigma\n",
    "embeddings.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "u.shape, sigma.shape, vt.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's play with embeddings!\n",
    "\n",
    "Implement the cosine similarity:\n",
    "\n",
    "$$ cos(u, v) = \\frac{\\langle u, v \\rangle}{|| u ||_2 || v ||_2} $$\n",
    "\n",
    "then check that you have the same results as `sklearn`.\n",
    "\n",
    "It is now the moment to move to the Sherlock Holmes corpus. Go back to first cell!\n",
    "\n",
    "Pick a few words in the vocabulary (`encoder`) and compute their 20 closest neighbors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "\n",
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actually, the similarity values have noise. Let's remove the negative values from the PMI matrix.\n",
    "\n",
    "> *When  representing  words,  there  is  some  intuition  behind  ignoring  negative  values:  humans  can easily think of positive associations (e.g.  “Canada” and “snow”) but find it much harder to invent negative ones (“Canada” and “desert”).  This suggests that the perceived similarity of two words is more influenced by the positive context they share than by the negative context they share.  It therefore makes some intuitive sense to discard the negatively associated contexts and mark them as “uninformative” (0) instead*\n",
    "\n",
    "It is now a PPMI matrix: positive pointwise mutual information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ppmi = pmi.copy()\n",
    "ppmi.data[ppmi.data < 0] = 0\n",
    "ppmi.eliminate_zeros()  # Remove the now-zero values\n",
    "ppmi\n",
    "# Now recompute the SVD with ppmi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic analogies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now **let's replace logic with algebra.**\n",
    "\n",
    "(Not everyone will be satisfied with this statement, I guess.)\n",
    "\n",
    "Recompute the PPMI matrix and embeddings for the bigger dataset, `text8`.\n",
    "\n",
    "Then attempt to answer some questions where we have to find $b^*$ in:\n",
    "\n",
    "$$ a \\textrm{ is to } a^* \\textrm{ as } b \\textrm{ is to } b^* $$\n",
    "\n",
    "(ex. *Paris is to France as Tokyo is to Japan*)\n",
    "\n",
    "How to express this in terms of embeddings?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Once text8 has been trained\n",
    "# encoder['paris'], encoder['france'], encoder['tokyo'], encoder['japan']\n",
    "# a a* b (b*?)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does it work? You may want to normalize the embeddings. Please do so into `embed_unit`.\n",
    "\n",
    "Please look for nasty analogies (shortcuts that are unfair)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please note that it is possible to reuse your factorization method to learn the embeddings. This is the topic of a future homework!\n",
    "\n",
    "# References\n",
    "\n",
    "Great reads!\n",
    "\n",
    "Levy, O., & Goldberg, Y. (2014). [Neural word embedding as implicit matrix factorization.](https://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf) In Advances in neural information processing systems (pp. 2177–2185).\n",
    "\n",
    "Doyle, A. C. (1891). [The Adventures of Sherlock Holmes: Adventure I. — A Scandal in Bohemia.](http://www.gutenberg.org/ebooks/1661) The Strand Magazine, vol. 2, pp. 61–75 (July 1891). "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
